Análisis exploratorio para la base de Algas

Cargamos nuestros datos y hacemos una pequeña exploración de los mismos:

algas <- read.table(file = "/home/jared/Proyectos/itam-dm/data/algas/algas.txt", 
                    header = FALSE,
                    dec = ".",
                    col.names = c('temporada', 'tamaño', 'velocidad', 'mxPH',
                                  'mnO2', 'Cl', 'NO3', 'NO4', 'oPO4', 'PO4',
                                  'Chla', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7'),
                    na.strings=c('XXXXXXX')
)
head(algas)
##   temporada tamaño velocidad mxPH mnO2     Cl    NO3     NO4    oPO4
## 1    winter  small    medium 8.00  9.8 60.800  6.238 578.000 105.000
## 2    spring  small    medium 8.35  8.0 57.750  1.288 370.000 428.750
## 3    autumn  small    medium 8.10 11.4 40.020  5.330 346.667 125.667
## 4    spring  small    medium 8.07  4.8 77.364  2.302  98.182  61.182
## 5    autumn  small    medium 8.06  9.0 55.350 10.416 233.700  58.222
## 6    winter  small      high 8.25 13.1 65.750  9.248 430.000  18.250
##       PO4 Chla   a1   a2   a3  a4   a5   a6  a7
## 1 170.000 50.0  0.0  0.0  0.0 0.0 34.2  8.3 0.0
## 2 558.750  1.3  1.4  7.6  4.8 1.9  6.7  0.0 2.1
## 3 187.057 15.6  3.3 53.6  1.9 0.0  0.0  0.0 9.7
## 4 138.700  1.4  3.1 41.0 18.9 0.0  1.4  0.0 1.4
## 5  97.580 10.5  9.2  2.9  7.5 0.0  7.5  4.1 1.0
## 6  56.667 28.4 15.1 14.6  1.4 0.0 22.5 12.6 2.9
describe(algas)
## algas 
## 
##  18  Variables      200  Observations
## ---------------------------------------------------------------------------
## temporada 
##       n missing  unique 
##     200       0       4 
## 
## autumn (40, 20%), spring (53, 26%), summer (45, 22%) 
## winter (62, 31%) 
## ---------------------------------------------------------------------------
## tamaño 
##       n missing  unique 
##     200       0       3 
## 
## large (45, 22%), medium (84, 42%), small (71, 36%) 
## ---------------------------------------------------------------------------
## velocidad 
##       n missing  unique 
##     200       0       3 
## 
## high (84, 42%), low (33, 16%), medium (83, 42%) 
## ---------------------------------------------------------------------------
## mxPH 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     199       1      72       1   8.012   7.081   7.340   7.700   8.060 
##     .75     .90     .95 
##   8.400   8.700   8.873 
## 
## lowest : 5.60 5.70 6.40 6.50 6.60, highest: 9.00 9.06 9.10 9.50 9.70 
## ---------------------------------------------------------------------------
## mnO2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     198       2      88       1   9.118   4.485   5.770   7.725   9.800 
##     .75     .90     .95 
##  10.800  11.700  11.815 
## 
## lowest :  1.5  1.8  3.2  3.3  3.4, highest: 12.5 12.6 12.9 13.1 13.4 
## ---------------------------------------------------------------------------
## Cl 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     190      10     178       1   43.64   3.061   4.970  10.981  32.730 
##     .75     .90     .95 
##  57.823  88.600 130.087 
## 
## lowest :   0.222   0.800   1.170   1.450   1.549
## highest: 173.750 187.183 194.750 208.364 391.500 
## ---------------------------------------------------------------------------
## NO3 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     198       2     192       1   3.282  0.4023  0.6912  1.2960  2.6750 
##     .75     .90     .95 
##  4.4463  6.1916  7.9369 
## 
## lowest :  0.050  0.102  0.130  0.230  0.267
## highest:  9.248  9.715  9.773 10.416 45.650 
## ---------------------------------------------------------------------------
## NO4 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     198       2     179       1   501.3   10.00   15.00   38.33  103.17 
##     .75     .90     .95 
##  226.95  805.33 1922.87 
## 
## lowest :     5.0     5.8     8.0    10.0    10.5
## highest:  4073.3  5738.3  6400.0  8777.6 24064.0 
## ---------------------------------------------------------------------------
## oPO4 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     198       2     173       1   73.59    2.00    3.94   15.70   40.15 
##     .75     .90     .95 
##   99.33  193.21  248.34 
## 
## lowest :   1.000   1.250   1.333   1.625   1.800
## highest: 346.167 412.333 428.750 467.500 564.600 
## ---------------------------------------------------------------------------
## PO4 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     198       2     189       1   137.9   6.455  11.350  41.375 103.285 
##     .75     .90     .95 
## 213.750 286.100 345.650 
## 
## lowest :   1.0   2.5   3.0   4.0   6.0
## highest: 558.8 586.0 607.2 624.7 771.6 
## ---------------------------------------------------------------------------
## Chla 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     188      12     131       1   13.97   0.500   0.800   2.000   5.475 
##     .75     .90     .95 
##  18.308  31.817  61.733 
## 
## lowest :   0.20   0.30   0.40   0.50   0.60
## highest:  88.25  92.67  93.68  98.82 110.46 
## ---------------------------------------------------------------------------
## a1 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0     121    0.99   16.92    0.00    0.00    1.50    6.95 
##     .75     .90     .95 
##   24.80   50.72   64.33 
## 
## lowest :  0.0  1.1  1.2  1.4  1.5, highest: 75.8 81.9 82.7 86.6 89.8 
## ---------------------------------------------------------------------------
## a2 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      89    0.95   7.458    0.00    0.00    0.00    3.00 
##     .75     .90     .95 
##   11.38   21.50   28.38 
## 
## lowest :  0.0  1.0  1.2  1.4  1.5, highest: 40.7 40.9 41.0 53.6 72.6 
## ---------------------------------------------------------------------------
## a3 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      79    0.95   4.309   0.000   0.000   0.000   1.550 
##     .75     .90     .95 
##   4.925  13.510  20.275 
## 
## lowest :  0.0  1.0  1.1  1.2  1.4, highest: 24.8 25.3 25.9 35.1 42.8 
## ---------------------------------------------------------------------------
## a4 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      50    0.84   1.992   0.000   0.000   0.000   0.000 
##     .75     .90     .95 
##   2.400   5.000   7.605 
## 
## lowest :  0.0  1.0  1.1  1.2  1.3, highest: 11.5 12.7 13.4 28.8 44.6 
## ---------------------------------------------------------------------------
## a5 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      81    0.94   5.064    0.00    0.00    0.00    1.90 
##     .75     .90     .95 
##    7.50   14.91   20.04 
## 
## lowest :  0.0  1.0  1.1  1.2  1.4, highest: 28.8 34.2 34.3 35.6 44.4 
## ---------------------------------------------------------------------------
## a6 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      76    0.85   5.964   0.000   0.000   0.000   0.000 
##     .75     .90     .95 
##   6.925  17.110  31.815 
## 
## lowest :  0.0  1.0  1.2  1.4  1.5, highest: 42.7 49.4 52.5 64.6 77.6 
## ---------------------------------------------------------------------------
## a7 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     200       0      51    0.88   2.496    0.00    0.00    0.00    1.00 
##     .75     .90     .95 
##    2.40    6.10   10.88 
## 
## lowest :  0.0  1.0  1.1  1.2  1.4, highest: 22.1 25.6 30.1 31.2 31.6 
## ---------------------------------------------------------------------------

Ahora con la función que creamos en utils.r, hacemos una exploración visual de los datos, esto para comprender mejor como se distribuyen.

for(i in 1:ncol(algas)){
  print (graf_expl(algas,names(algas[i])))
}

Ahora veamos las relaciones de dos en dos de las variables.

for(i in 1:(ncol(algas)-1)){
  for (e in (i+1):ncol(algas)){
    print (graf_expl2(algas,names(algas[i]),names(algas[e])))
  }
}

Aquí observamos las relaciones entre los datos faltantes, se observa que hay una clara relación entre la variable Chla y Cl. Como también se puede ver una leve relación entre las variables Cl, NO3, NO4,oPO4,PO4 y Chla.

corr_na<-cor(is.na(algas))
corr_na[is.na(corr_na)]<-0
corrplot(corr_na)

Hacemos una función que sustituya los NA por la media, cuando es una variable numerica y por la moda cuando es una variable categorica.

sust_na<-function(col){
  if (is.numeric(col)){
    media<-mean(col, na.rm=T)
    col[is.na(col)]<-media
  } else {
    Mode <- function(x) {
      ux <- unique(x)
      ux[which.max(tabulate(match(x, ux)))]      
    }
    moda<-Mode(col)
    col[is.na(col)]<-moda
  }
  col
}

Ahora aplicamos esta función a todas las columas:

algas_2<-apply(algas,2,sust_na)

Comprobamos que no haya NAs,

corr_na_2<-cor(is.na(algas_2))
corr_na_2[is.na(corr_na_2)]<-0
corrplot(corr_na_2)

Como hubodos columnas en las que no le aplico la función, vamos a hacerlo con for loop.

algas_3<-algas
for (i in 1:ncol(algas)){
  algas_3[,i]<-sust_na(algas[,i])
}

Ahora vemos que sí removimos los NAs, con lo que concluimos nuestro análisis de NAs.

corr_na_3<-cor(is.na(algas_3))
corr_na_3[is.na(corr_na_3)]<-0
corrplot(corr_na_3)